Lesson 6. Spatial Queries

Spatial analysis is a process that begins with exploring and mapping a dataset and can lead to potentially complex models and visualizations of real world features and phenomena. Spatial queries are the building blocks of this analytical process. These queries are software operations that allow us to ask questions of our data and which return data metrics, subsets or new data objects. In this lesson we explore the two basic types of spatial queries: measurement queries and relationship queries.


Instructor Notes


Types of Spatial Queries

The basic types of spatial queries are:

  • Measurement queries
    • What is feature A’s length?
    • What is feature A’s perimeter?
    • What is feature A’s area?
    • What is feature A’s distance from feature B?
    • etc.
  • Relationship queries
    • Does feature A intersect with feature B?
    • Is feature A within feature B?
    • Does feature A cross feature B?
    • etc.

Both of these types of queries operate on the geometry of features in one or two datasets and are dependent on the type of geometry. For example, with point features you can make distance measurements or ask what points are spatially inside polygon objects. Polygon features, on the other hand, allow for a wider range of both measurement and spatial relationship queries.

An important distinction between these two types of queries is that measurement queries depend on the CRS of the data while spatial relationship queries do not. This is because topological relationships, the term used to describe spatial relationships, are invariant to rotation, translation and scaling transformations like those that CRS transformations entail.

Attribute vs. Spatial Queries

We already know how to do attribute queries with our data. For example, we can select one or more specific counties by name or select those counties where the total population is greater than 100,000 because we have these columns in the dataset.

Spatial queries are special because they are dynamic. For example, we can compute area from the geometry without it already being encoded or we can select BART stations in Berkeley even if city is not encoded in the BART data by linking those two spatial datasets in the same geographic space. This dynamic query capability is extremely powerful!

In this lesson we’ll work through examples of each of those types of queries.

Then we’ll try a very common spatial analysis method that is a conceptual amalgam of those two types: proximity analysis.

6.0 Load and prep the data

Load the libraries we will use.

library(sf)
library(tmap)
library(here)

Read in the CA census tracts data and then take a look at its geometry and attributes.

census_tracts = st_read(here("notebook_data",
                             "census",
                             "Tracts",
                             "cb_2019_06_tract_500k.shp"))
## Reading layer `cb_2019_06_tract_500k' from data source 
##   `C:\Users\kathe\berkeley_drive\analyses\dlab\Geospatial-Fundamentals-in-R-with-sf\notebook_data\census\Tracts\cb_2019_06_tract_500k.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 8041 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -124.4096 ymin: 32.53416 xmax: -114.1312 ymax: 42.00948
## Geodetic CRS:  NAD83
plot(census_tracts$geometry)

head(census_tracts)
## Simple feature collection with 6 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -122.327 ymin: 33.83556 xmax: -117.9535 ymax: 37.94975
## Geodetic CRS:  NAD83
##   STATEFP COUNTYFP TRACTCE             AFFGEOID       GEOID    NAME LSAD
## 1      06      013  370000 1400000US06013370000 06013370000    3700   CT
## 2      06      001  442301 1400000US06001442301 06001442301 4423.01   CT
## 3      06      037  405101 1400000US06037405101 06037405101 4051.01   CT
## 4      06      037  199800 1400000US06037199800 06037199800    1998   CT
## 5      06      037  291300 1400000US06037291300 06037291300    2913   CT
## 6      06      037  292000 1400000US06037292000 06037292000    2920   CT
##     ALAND AWATER                       geometry
## 1  999356      0 MULTIPOLYGON (((-122.327 37...
## 2 1429049      0 MULTIPOLYGON (((-121.9701 3...
## 3 1121229      0 MULTIPOLYGON (((-117.9693 3...
## 4  651258      0 MULTIPOLYGON (((-118.2156 3...
## 5 2353751  45836 MULTIPOLYGON (((-118.3091 3...
## 6 4000998   1856 MULTIPOLYGON (((-118.3091 3...

Select just the Alameda County census tracts.

census_tracts_ac = census_tracts[census_tracts$COUNTYFP=='001',]
plot(census_tracts_ac)

6.1 Measurement Queries

We’ll start off with some simple measurement queries.

We can get the area of each of our census tracts using thesf function st_area.

st_area(census_tracts_ac)[1:10]
## Units: [m^2]
##  [1] 1433707.1  860583.5  485411.1  337395.8  367477.3 4350755.5  344063.1
##  [8] 3077954.4 1029301.4  485466.0

Okay!

We got…

numbers!

…?

Question

  1. What do those numbers mean?
  2. What are the units?
  3. And if we’re not sure, how might be find out?

Let’s take a look at our CRS.

st_crs(census_tracts_ac)
## Coordinate Reference System:
##   User input: NAD83 
##   wkt:
## GEOGCRS["NAD83",
##     DATUM["North American Datum 1983",
##         ELLIPSOID["GRS 1980",6378137,298.257222101,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["latitude",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["longitude",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     ID["EPSG",4269]]

Wow! We’re working with data that are in what is called an unprojected CRS. That means that the coordinates are latitude and longitude values and the units are decimal degrees. However, the sf::st_area function automatically returned area measurements in square meters (rather than in square degrees, which don’t make sense.)

How did it do this? Well, you can check out the help documentation for ?st_area for more information. If the data have a projected CRS, st_area uses Euclidean geometry to return area measurements in the units of the CRS. For an unprojected CRS, st_area calculates geodetic area on a curved surface model of the Earth and returns measurements in square meters. Pretty cool and pretty useful right?


That said, when doing spatial analysis, we will almost always want to convert all of our data to the same projected CRS since many spatial operations do not work with geographic CRSs.

Time to project! We’ll transform the data to the UTM Zone 10N, NAD83 CRS (EPSG 26910) which is appropriate for Northern California location data and is highly accurate for measurement queries for areas within the zone.

#Transform CRS of census tract data
census_tracts_ac_utm10 = st_transform(census_tracts_ac, 26910)

Now check it..especially look for the units.

st_crs(census_tracts_ac_utm10)
## Coordinate Reference System:
##   User input: EPSG:26910 
##   wkt:
## PROJCRS["NAD83 / UTM zone 10N",
##     BASEGEOGCRS["NAD83",
##         DATUM["North American Datum 1983",
##             ELLIPSOID["GRS 1980",6378137,298.257222101,
##                 LENGTHUNIT["metre",1]]],
##         PRIMEM["Greenwich",0,
##             ANGLEUNIT["degree",0.0174532925199433]],
##         ID["EPSG",4269]],
##     CONVERSION["UTM zone 10N",
##         METHOD["Transverse Mercator",
##             ID["EPSG",9807]],
##         PARAMETER["Latitude of natural origin",0,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8801]],
##         PARAMETER["Longitude of natural origin",-123,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8802]],
##         PARAMETER["Scale factor at natural origin",0.9996,
##             SCALEUNIT["unity",1],
##             ID["EPSG",8805]],
##         PARAMETER["False easting",500000,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8806]],
##         PARAMETER["False northing",0,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8807]]],
##     CS[Cartesian,2],
##         AXIS["(E)",east,
##             ORDER[1],
##             LENGTHUNIT["metre",1]],
##         AXIS["(N)",north,
##             ORDER[2],
##             LENGTHUNIT["metre",1]],
##     USAGE[
##         SCOPE["Engineering survey, topographic mapping."],
##         AREA["North America - between 126°W and 120°W - onshore and offshore. Canada - British Columbia; Northwest Territories; Yukon. United States (USA) - California; Oregon; Washington."],
##         BBOX[30.54,-126,81.8,-119.99]],
##     ID["EPSG",26910]]

Now let’s try our area calculation again.

st_area(census_tracts_ac_utm10)[1:10]
## Units: [m^2]
##  [1] 1433564.7  860479.2  485354.4  337347.3  367423.6 4350606.4  344013.5
##  [8] 3077827.6 1029173.7  485399.5

What if we compare areas calculated with our unprojected and projected CRS data?

# Using format to make for easier to read display
print(format(st_area(census_tracts_ac)[[1]], big.mark = ','))
## [1] "1,433,707 [m^2]"
print(format(st_area(census_tracts_ac_utm10)[[1]], big.mark = ","))
## [1] "1,433,565 [m^2]"

Hmmm… The numbers are a bit different…specifically…

format((st_area(census_tracts_ac)[[1]] - st_area(census_tracts_ac_utm10)[[1]]),
       digits = 0, 
       big.mark = ',')
## [1] "142 [m^2]"

You may have noticed that our census tracts already have an area column in them.

Let’s also compare the calculated areas with the data in this column.

print(st_area(census_tracts_ac)[[1]])
## 1433707 [m^2]
print(st_area(census_tracts_ac_utm10)[[1]])
## 1433565 [m^2]
print(census_tracts$ALAND[1])
## [1] 999356

Question

What explains the discrepancy? Which areas are correct? Which are incorrect?

Doing more

We can also calculate the area for Alameda county summing the areas of all census tracts.

sum(st_area(census_tracts_ac_utm10))
## 1943203551 [m^2]

We can look up how large Alameda County is to check our work.The county is 739 miles2, which is around 1,914,001,213 meters2. I’d say we’re pretty close!

# Sum the area of all Alameda county Census Tracts dynamically in square miles...
sum(units::set_units(st_area(census_tracts_ac_utm10), mi^2))
## 750.2751 [mi^2]

Calculating the area of all features and adding the output to the dataframe is a useful operation because it allows us to convert count variables like population to densities.

# Add the area of all Alameda County Census tracts to the data frame
census_tracts_ac_utm10$area_sqmi <- units::set_units(st_area(census_tracts_ac_utm10), mi^2)

# Check it by summing
print(sum(census_tracts_ac_utm10$area_sqmi))
## 750.2751 [mi^2]
# Take a look
head(census_tracts_ac_utm10,3)
## Simple feature collection with 3 features and 10 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 579937.1 ymin: 4153335 xmax: 592636.3 ymax: 4167206
## Projected CRS: NAD83 / UTM zone 10N
##     STATEFP COUNTYFP TRACTCE             AFFGEOID       GEOID    NAME LSAD
## 2        06      001  442301 1400000US06001442301 06001442301 4423.01   CT
## 73       06      001  437400 1400000US06001437400 06001437400    4374   CT
## 114      06      001  437701 1400000US06001437701 06001437701 4377.01   CT
##       ALAND AWATER                       geometry        area_sqmi
## 2   1429049      0 MULTIPOLYGON (((590997 4154... 0.5535024 [mi^2]
## 73   858407      0 MULTIPOLYGON (((580052.1 41... 0.3322329 [mi^2]
## 114  484105      0 MULTIPOLYGON (((581636.7 41... 0.1873964 [mi^2]

You may be wondering how R is managing the units of our measurements.

It turns out that sf depends on the units package to track units.

This is super convenient! But there is a gotcha:

# convert to square kilometers
sum(st_area(census_tracts_ac_utm10)) / (1000^2)
## 1943.204 [m^2]

Oops! Our manual conversion to square kilometers gave us the right number but kept the now-wrong units!

Here’s the proper way to convert:

units::set_units(sum(st_area(census_tracts_ac_utm10)), km^2)
## 1943.204 [km^2]

Much nicer! In case you’re wondering how we knew the right abbreviation to use for kilometers, check out the leftmost column in this reference table:

# View(units::valid_udunits())

Calculating Length or Permeter

We can use the st_length operator in the same way to calculate the length or perimeter of features. Always take note of the output units!

st_length(census_tracts)[1:10]
## Units: [m]
##  [1] 0 0 0 0 0 0 0 0 0 0

Calculating Distance

The st_distance function can be used to find the pairwise distance between two sets of geometries.

st_distance(census_tracts_ac_utm10, census_tracts_ac_utm10)[1:5,1:5]
## Units: [m]
##          [,1]       [,2]       [,3]     [,4]     [,5]
## [1,]     0.00 15426.9778 14079.4434 41190.08 40789.89
## [2,] 15426.98     0.0000   686.6722 24396.80 24010.87
## [3,] 14079.44   686.6722     0.0000 26033.64 25670.59
## [4,] 41190.08 24396.8018 26033.6355     0.00     0.00
## [5,] 40789.89 24010.8725 25670.5915     0.00     0.00

You can also use it to find the distance between specific features.

# Identify my tracts of interest
mytracts = c('4201','4202','4203','4204')
# What is the distance between tract 4201 and all other tracts
st_distance(census_tracts_ac_utm10[census_tracts_ac_utm10$NAME=='4101',],
            census_tracts_ac_utm10[census_tracts_ac_utm10$NAME %in% mytracts,] )
## Units: [m]
##          [,1]     [,2]     [,3]     [,4]
## [1,] 19220.14 18822.72 19263.81 18946.69

6.2 Spatial Relationship Queries

Spatial relationship queries consider how two geometries or sets of geometries relate to one another in space. For example, you may want to know what schools are located within the City of Berkeley or what East Bay Regional Parks have land within Berkeley. You may also want to combine a measurement query with a spatial relationship query. Example, you may want to know the total length of freeways within the city of Berkeley.

Here is a list of some of the more commonly used sf spatial relationship operations.

  • st_intersects
  • st_within
  • st_contains
  • st_disjoint

These can be used to select features in one dataset based on their spatial relationship to another. In other works, you can use these operations to make spatial selections / create spatial subsets.

Enough talk. Let’s work through some examples.

What Alameda County Schools are in Berkeley?

First, load the CA Places data and select the city of Berkeley and save it to a sf dataframe.

places = st_read(here("notebook_data",
                      "census",
                      "Places",
                      "cb_2019_06_place_500k.shp"))
## Reading layer `cb_2019_06_place_500k' from data source 
##   `C:\Users\kathe\berkeley_drive\analyses\dlab\Geospatial-Fundamentals-in-R-with-sf\notebook_data\census\Places\cb_2019_06_place_500k.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 1521 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -124.2694 ymin: 32.53416 xmax: -114.229 ymax: 41.99314
## Geodetic CRS:  NAD83
berkeley = places[places$NAME=='Berkeley',]

plot(berkeley$geometry)

Then, load the Alameda County schools data and make it a spatial dataframe.

schools_df = read.csv(here("notebook_data",
                           "alco_schools.csv"))

schools_sf = st_as_sf(schools_df, 
                      coords = c('X','Y'),
                      crs = 4326)

Check that the two sf dataframes have the same CRS.

st_crs(schools_sf) == st_crs(berkeley)
## [1] FALSE

They don’t have the same CRS so we need to align them. Let’s transform (or reproject) the CRS of both of these dataframes to UTM10N, NAD83 (EPSG 26910). This is a commonly used CRS for Northern CA data.

# Transform data CRSes
schools_utm10 <- st_transform(schools_sf, 26910)
berkeley_utm10 <- st_transform(berkeley, 26910)

If you look at the Schools data you will see that it has a City column. So we can subset the data by attribute to select the Schools in Berkeley. No need to do a spatial selection.

berkeley_schools = schools_utm10[schools_utm10$City == 'Berkeley', ]
dim(berkeley_schools)
## [1] 31  8

Confirm the results by plotting the data

plot(berkeley_utm10$geometry)
plot(berkeley_schools$geometry, add = T)

That looks good and was a relatively simple operation. But what if the schools data didn’t have that city column or if only some of the rows had a value in that column. How can we identify the schools in Berkeley spatially?

Here’s how!

# SPATIALLY select only the schools within Berkeley
berkeley_schools_spatial = schools_utm10[berkeley_utm10, , op = st_intersects]  # NO QUOTES!!!

Yes that was it! Take a long look at that simple yet powerful spatial selection syntax.

You should interpret that syntax as:

  • "Select all features (i.e. rows) in the schools_utm10 dataframe:

    • schools_utm10[berkeley_utm10, , op=st_intersects]
  • and all of the columns:

    • schools_utm10[berkeley_utm10 , , op=st_intersects]

    (all because the extraction brackets have no second argument)

  • whose geometry spatially intersects the Berkeley_utm10 geometry

    • schools_utm10[berkeley_utm10, , op=st_intersects]
Important

The op=st_intersects argument is optional because st_intersects is the default spatial selector.

To emphasize this, let’s rerun the last command.

# SPATIALLY select only the schools within Berkeley
berkeley_schools_spatial = schools_utm10[berkeley_utm10, ]

What does spatiallly intersects mean?

Here’s one way to explain it.

Geometry A spatially intersects Geometry B if any of its parts (e.g., a point, line segment, or polygon) is equivalent to, touches, crosses, is contained by, contains, or overlaps any part of Geometry B.

So st_intersects is the mother of all spatial relationships! It is the most general and the most useful. However, you can specify any of those more specific spatial relationships by setting op= to any of the options listed in the ?st_intersects? help documentation.

Let’s check out the sf object that our selection returned.

# How many schools did we get
dim(berkeley_schools_spatial)
## [1] 32  8
# Map the results
plot(berkeley_utm10$geometry)
plot(berkeley_schools_spatial$geometry, add = T)

Interestingly, we have one more school in Berkeley based on the spatial selection!? Let’s take a look.

plot(berkeley_utm10$geometry)
plot(berkeley_schools_spatial$geometry, add = T)
plot(berkeley_schools$geometry, col="red", add = T)

Let’s use an interactive tmap to zoom into the school that was not selected by attribute but was spatially selected.

tmap_mode('view')
## tmap mode set to interactive viewing
tm_shape(berkeley_utm10) +
  tm_borders() +
  tm_shape(berkeley_schools_spatial) +
  tm_dots(col = "black", size = 0.3) +
  tm_shape(berkeley_schools) +
  tm_dots(col = "red", size = 0.1)

IMPORTANT: The default spatial selection operator is st_intersects. If you want to use any other operator, it must be specified.

For example, we can use the st_disjoint operator to select only those schools NOT in Berkeley.

# Select all Alameda County Schools NOT in Berkeley with the disjoint operator
berkeley_schools_disjoint = schools_utm10[berkeley_utm10, , op = st_disjoint]

# Plot the result
plot(berkeley_schools_disjoint$geometry)
plot(berkeley_utm10, 
     col = NA, 
     border = "red", 
     add = T)
## Warning in plot.sf(berkeley_utm10, col = NA, border = "red", add = T): ignoring
## all but the first attribute

There is no need to memorize these spatial operators (aka predicates)! Here is a fantastic sf cheatsheet that lists and briefly explains all these common functions (and many more).


Protected Areas in Alameda County

Let’s load a new dataset, the CA Protected Areas Database (CPAD), to demonstrate these spatial queries in more detail.

This dataset contains all of the protected areas (parks and the like) in California.

We will use this data and the Alameda County census tract data that we created earlier to ask, “What protected areas are within Alameda County?”

First load the CPAD data.

cpad <- st_read(here("notebook_data",
                     "protected_areas",
                     "CPAD_2020a_Units.shp"))
## Reading layer `CPAD_2020a_Units' from data source 
##   `C:\Users\kathe\berkeley_drive\analyses\dlab\Geospatial-Fundamentals-in-R-with-sf\notebook_data\protected_areas\CPAD_2020a_Units.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 17068 features and 21 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -374984.2 ymin: -604454.8 xmax: 540016.3 ymax: 449743.2
## Projected CRS: NAD83 / California Albers

What is the CRS of the CPAD data?

Let’s transform the data to match census_tracts_ac_utm10.

cpad_utm10 <- st_transform(cpad, st_crs(census_tracts_ac_utm10))

Let’s plot the data in so that we know what to expect. CPAD is big, so wait for it…

plot(census_tracts_ac_utm10$geometry, col = 'grey', border = "grey")
plot(cpad_utm10$geometry, col = 'green', add = T)

We can see from our map that some of the protected areas are completely within Alameda County, some of them overlap, and some are completely outside of the county. To get both of the “inside” and “overlaps” cases we use the st_intersects spatial selection operator, which is the default. Let’s check it out.

cpad_intersects <- cpad_utm10[census_tracts_ac_utm10,]  # st_intersects
cpad_within <- cpad_utm10[census_tracts_ac_utm10, , op = st_within] # st_within

We can use tmap to explore the difference in the results from st_intersects vs st_within

tmap_mode('view')
## tmap mode set to interactive viewing
tm_shape(census_tracts_ac_utm10) +
  tm_polygons(col = "gray", 
              border.col = "grey") +
tm_shape(cpad_intersects) +
  tm_borders(col = "green") +
tm_shape(cpad_within) +
  tm_borders(col = 'red')

What you can see from the above, is that by default, st_intersects returns the features that intersect but it does not clip the features to the boundary of Alameda County. For that, we would need to use a different spatial operation - st_intersection.

Let’s try it!

cpad_in_ac <- st_intersection(cpad_utm10, census_tracts_ac_utm10)
## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries

Great! Now, if we scroll the resulting sf object we’ll see that the COUNTY column of our resulting subset gives us a good sanity check on our results. Or does it?

table(cpad_in_ac$COUNTY)
## 
##      Alameda Contra Costa  San Joaquin  Santa Clara 
##          837           20            2            4

Always check your output - both the attribute table & the geometry!

head(cpad_in_ac)
## Simple feature collection with 6 features and 31 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 562862.8 ymin: 4166288 xmax: 598493.1 ymax: 4184898
## Projected CRS: NAD83 / UTM zone 10N
##       ACCESS_TYP UNIT_ID        UNIT_NAME SUID_NMA AGNCY_ID
## 8992 Open Access   14846     Schafer Park    25179     2045
## 9491 Open Access   15886    Eden Greenway    18427     2045
## 2479 Open Access   17473      Lowell Park    21557     1228
## 9186 Open Access   15251 Owens Plaza Park    23276     1257
## 9194 Open Access   15260   Creekside Park    17777     1257
## 1778 Open Access   13264 Jefferson Square    20383     1228
##                                     AGNCY_NAME        AGNCY_LEV
## 8992 Hayward Area Recreation and Park District Special District
## 9491 Hayward Area Recreation and Park District Special District
## 2479                          Oakland, City of             City
## 9186                       Pleasanton, City of             City
## 9194                       Pleasanton, City of             City
## 1778                          Oakland, City of             City
##                      AGNCY_TYP
## 8992 Recreation/Parks District
## 9491 Recreation/Parks District
## 2479               City Agency
## 9186               City Agency
## 9194               City Agency
## 1778               City Agency
##                                                  AGNCY_WEB            LAYER
## 8992                            http://www.haywardrec.org/ Special District
## 9491                            http://www.haywardrec.org/ Special District
## 2479 http://www2.oaklandnet.com/Government/o/opr/index.htm             City
## 9186                    http://www.cityofpleasantonca.gov/             City
## 9194                    http://www.cityofpleasantonca.gov/             City
## 1778 http://www2.oaklandnet.com/Government/o/opr/index.htm             City
##      MNG_AG_ID                                MNG_AGENCY       MNG_AG_LEV
## 8992      2045 Hayward Area Recreation and Park District Special District
## 9491      2045 Hayward Area Recreation and Park District Special District
## 2479      1228                          Oakland, City of             City
## 9186      1257                       Pleasanton, City of             City
## 9194      1257                       Pleasanton, City of             City
## 1778      1228                          Oakland, City of             City
##                     MNG_AG_TYP PARK_URL  COUNTY  ACRES       LABEL_NAME YR_EST
## 8992 Recreation/Parks District     <NA> Alameda  1.748     Schafer Park      0
## 9491 Recreation/Parks District     <NA> Alameda 54.736    Eden Greenway   1970
## 2479               City Agency     <NA> Alameda  8.710      Lowell Park      0
## 9186               City Agency     <NA> Alameda  2.554 Owens Plaza Park      0
## 9194               City Agency     <NA> Alameda  5.719   Creekside Park   2000
## 1778               City Agency     <NA> Alameda  1.376 Jefferson Square      0
##                     DES_TP GAP_STS STATEFP COUNTYFP TRACTCE
## 8992            Local Park       4      06      001  437400
## 9491 Local Recreation Area       4      06      001  437400
## 2479            Local Park       4      06      001  402500
## 9186            Local Park       4      06      001  450743
## 9194            Local Park       4      06      001  450743
## 1778            Local Park       4      06      001  403100
##                  AFFGEOID       GEOID    NAME LSAD   ALAND AWATER
## 8992 1400000US06001437400 06001437400    4374   CT  858407      0
## 9491 1400000US06001437400 06001437400    4374   CT  858407      0
## 2479 1400000US06001402500 06001402500    4025   CT  365454      0
## 9186 1400000US06001450743 06001450743 4507.43   CT 4333270      0
## 9194 1400000US06001450743 06001450743 4507.43   CT 4333270      0
## 1778 1400000US06001403100 06001403100    4031   CT  346852      0
##             area_sqmi                       geometry
## 8992 0.3322329 [mi^2] POLYGON ((580615.7 4166758,...
## 9491 0.3322329 [mi^2] POLYGON ((580206.2 4166317,...
## 2479 0.1418630 [mi^2] POLYGON ((562897.3 4184898,...
## 9186 1.6797785 [mi^2] POLYGON ((598045.1 4172639,...
## 9194 1.6797785 [mi^2] POLYGON ((598289.7 4172276,...
## 1778 0.1328244 [mi^2] POLYGON ((563481.4 4184006,...

Let’s also use an overlay plot to check the output geometry.

tm_shape(census_tracts_ac_utm10) + 
  tm_polygons(col = 'gray', 
              border.col = 'gray') +
tm_shape(cpad_in_ac) + 
  tm_polygons(col = 'ACRES', 
              palette = 'YlGn',
              border.col = 'black', 
              lwd = 0.4, 
              alpha = 0.8,
              title =  'Protected areas in Alameda County, colored by area')

st_intersects or st_intersection?

It really depends! But make sure you understand the difference.

st_intersects is a logical operator that returns True if two geometries intersect in any way. When used to subset (or filter) a spatial dataframe, st_intersects returns those features in the dataframe that intersect with the filter dataframe.

On the other hand, st_intersection returns a new spatial dataframe that set intersection of the two dataframes, including both the geometries and attributes of the intersecting features. Use st_intersection with caution and always check your results.

Exercise: Spatial Relationship Query

It’s your turn.

Write a spatial relationship query to create a new dataset containing only the BART stations in Berkeley.

Run the next two cells to (1) load the dataset containing Berkeley BART stations and then reproject it to the same CRS as that used by the Berkeley_utm10 dataframe (EPSG: 26910)’ (2) plot these two datasets in an overlay map.

Then, write your own code to: 1. Spatially select the BART stations that are within Berkeley 2. Plot the Berkeley boundary and then overlay the selected BART stations.

# load the Berkeley boundary
bart_df = st_read(here("notebook_data",
                       "transportation",
                       "bart.csv"))
## Reading layer `bart' from data source 
##   `C:\Users\kathe\berkeley_drive\analyses\dlab\Geospatial-Fundamentals-in-R-with-sf\notebook_data\transportation\bart.csv' 
##   using driver `CSV'
## Warning: no simple feature geometries present: returning a data.frame or tbl_df
bart_sf = st_as_sf(bart_df, 
                   coords = c('lon', 'lat'),
                   crs = 4326)
  
# transform to EPSG:26910
bart_utm10 = st_transform(bart_sf, st_crs(berkeley_utm10))

# display
head(berkeley_utm10)
## Simple feature collection with 1 feature and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 559347.6 ymin: 4188961 xmax: 567371.4 ymax: 4195617
## Projected CRS: NAD83 / UTM zone 10N
##     STATEFP PLACEFP  PLACENS         AFFGEOID   GEOID     NAME LSAD    ALAND
## 740      06   06000 02409837 1600000US0606000 0606000 Berkeley   25 27127434
##       AWATER                       geometry
## 740 18715615 MULTIPOLYGON (((559347.6 41...

Plot the data together

plot(bart_utm10$geometry)
plot(berkeley_utm10$geometry, border = 'blue', add = T)

Your code here!

Now, in the cell below, write the code to spatially select the Berkeley BART stations, then make the map.

# YOUR CODE HERE:

# Spatially select the BART stations within Berkeley

# Plot the Bart stations in Berkeley overlaid on top of the Berkeley City Boundary

Solution hidden here!

To see it, right-click and select “inspect element” in your browser (or look in the 06_Spatial_Queries.Rmd file, line 599).


6.3 Proximity Analysis

Now that we’ve seen the basic idea of spatial measurement and spatial relationship queries, let’s take a look at a common analysis that combines those concepts: promximity analysis.

Proximity analysis seeks to identify near features - or, in other words, all features in a focal feature set that are within some maximum distance of features in a reference feature set.

A very common workflow for this analysis is:

  1. Buffer around the features in the reference dataset to create buffer polygons.

  2. Run a spatial relationship query to find all focal features that intersect (or are within) the buffer polygons.


Let’s read in our bike boulevard data again.

Then we’ll find out which of our Berkeley schools are within a block’s distance (200 meters) of the bike boulevards.

bike_blvds <- st_read(here("notebook_data",
                           "transportation",
                           "BerkeleyBikeBlvds.geojson"))
## Reading layer `BerkeleyBikeBlvds' from data source 
##   `C:\Users\kathe\berkeley_drive\analyses\dlab\Geospatial-Fundamentals-in-R-with-sf\notebook_data\transportation\BerkeleyBikeBlvds.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 211 features and 10 fields
## Geometry type: MULTILINESTRING
## Dimension:     XY
## Bounding box:  xmin: 561541.2 ymin: 4189007 xmax: 566451.6 ymax: 4193483
## Projected CRS: WGS 84 / UTM zone 10N
plot(bike_blvds$geometry)

Of course, we need to reproject the boulevards to our projected CRS.

bike_blvds_utm10 = st_transform(bike_blvds, st_crs(berkeley_utm10))

Now we can create our 200 meter bike boulevard buffers.

bike_blvds_buf = st_buffer(bike_blvds_utm10, dist = 200)

Now let’s overlay everything.

tm_shape(berkeley_utm10) + 
  tm_polygons(col = 'lightgrey') + 
tm_shape(bike_blvds_buf) + 
  tm_polygons(col = 'pink', alpha = 0.5) +
tm_shape(bike_blvds_utm10) + 
  tm_lines() + 
tm_shape(berkeley_schools_spatial) + 
  tm_dots(col = 'purple', size = 0.2)

Great! Looks like we’re all ready to run our spatial relationship query to complete the proximity analysis. At this point (pun intended) select the schools that are in within the bike boulevard buffer polygons.

schools_near_blvds = berkeley_schools_spatial[bike_blvds_buf, ]

# or
#schools_near_blvds = berkeley_schools_spatial[bike_blvds_buf, , op = 'st_within']

Now let’s overlay again, to see if the schools we selected make sense.

tm_shape(berkeley_utm10) + 
  tm_polygons(col = 'lightgrey') + 
  
# add the bike blvd buffer polygons  
tm_shape(bike_blvds_buf) + 
  tm_polygons(col = 'pink', alpha = 0.5) +

# add the bike blvd line features  
tm_shape(bike_blvds_utm10) + 
  tm_lines() + 

# add all berkeley schools  
tm_shape(berkeley_schools_spatial) + 
  tm_dots(col = 'purple', size = 0.2) +

# add schools near bike boulevards in yellow
tm_shape(schools_near_blvds) + 
  tm_dots(col = 'yellow', size = 0.2)

Leveling up!

You can use st_distance and its companion function st_nearest_feature to compute the distance between each feature and the nearest bike boulevard. The st_nearest_feature function returns the ID of the closest feature.

# Identify the nearest bike boulevard for each school
nearest = st_nearest_feature(berkeley_schools_spatial, bike_blvds_utm10)

# take a look!
nearest
##  [1]  71 171  39 136  42  30 163 172 127 171 189 156 137  33 187 197  50 207 169
## [20] 102 125 137   3 109 207  76 135 173  56 102 146  76

Then we can calculate the distance between each school and it’s nearest bike boulevard.

st_distance(berkeley_schools_spatial, 
            bike_blvds_utm10[nearest,], 
            by_element = TRUE)
## Units: [m]
##  [1]  13.848161 985.459488 309.889446 369.402946 196.011379  15.332395
##  [7]  27.250406 439.758905 107.902846 926.216792 193.030072 181.836256
## [13] 373.736477 215.903128 184.321307 186.446907   1.288383  94.064527
## [19] 211.477566 218.613628 186.913116 230.212129  15.162313 188.829602
## [25] 232.764113 224.700672 173.920971  15.892361 514.765384  92.556921
## [31]  93.426741 128.131187

Exercise: Proximity Analysis

Now it’s your turn to try out a proximity analysis!

Write your own code to find all BERKELEY schools within walking distance (1 km) of a BART station.

As a reminder, let’s break this into steps:

  1. Spatially select the BART stations in Berkeley from the bart_utm10 dataframe
  2. Buffer your Berkeley BART stations to 1 km (HINT: remember your units!)
  3. Spatially select the schools within walking distance to the Berkeley BART stations.
  4. As always, plot your results for a good visual check!

To see the solution, look at the hidden text below.

Your code here

# YOUR CODE HERE:

# spatially select the Berkeley BART stations 
# # (You may have done this in a previous exercise.)

# buffer the BART stations to 1 km

# spatially select the schools within the buffers

# map your results with tmap
# # plot the Berkeley boundary (for reference) in lightgrey

# add the BART stations (for reference) to the plot in green

# add the BART buffers (for check) in lightgreen

# add all Berkeley schools (for reference) in black

# add the schools near BART (for check) in yellow

Solution hidden here!

To see it, right-click and select “inspect element” in your browser (or look in the 06_Spatial_Queries.Rmd file, line 770).

Bonus Exercise

Compute the distance between each Berkeley school and its nearest BART station!

# YOUR CODE HERE:

6.4 Recap

Leveraging what we’ve learned in our earlier lessons, we got to work with map overlays and start answering questions related to proximity. Key concepts include:

  • Measuring area and length
    • st_area,
    • st_length
    • st_distance
  • Spatial relationship queries
    • st_intersects,
    • st_intersection
    • st_within, etc.
  • Buffer analysis
    • st_buffer

 D-Lab @ University of California - Berkeley
 Team Geo